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We present results from our quantitative study of statistical and network properties 
of literary and scientific texts written in two languages: English and Polish. We show that 
Polish texts are described by the Zipf law with the scaling exponent smaller than the 
one for the English language. We also show that the scientific texts are typically charac- 
terized by the rank-frequency plots with relatively short range of power-law behavior as 
compared to the literary texts. We then transform the texts into their word-adjacency 
network representations and find another difference between the languages. For the ma- 
jority of the literary texts in both languages, the corresponding networks revealed the 
scale-free structure, while this was not always the case for the scientific texts. However, 
all the network representations of texts were hierarchical. We do not observe any qual- 
itative and quantitative difference between the languages. However, if we look at other 
network statistics like the clustering coefficient and the average shortest path length, the 
English texts occur to possess more clustered structure than do the Polish ones. This 
result was attributed to differences in grammar of both languages, which was also indi- 
cated in the Zipf plots. All the texts, however, show network structure that differs from 
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any of the Watts-Strogatz, the Barabasi- Albert, and the Erdos-Rcnyi architectures. 
Keywords: Complex networks; Zipf law; Natural language 
PACS Nos.: 64.60aq, 89.75Fb 

1. Introduction 

Natural language is an evolving system whose present structure can doubtlessly 
be considered a product of long history of self-organizational Like for many other 
self-organized systems known in Nature, the observables associated with language, 
being, for example, written texts or spoken messages, reveal quite sophisticated 
dynamics. Any language sample by no means is an amorphous mixture of symbols 
(letters, phonemes, morphemes, words, etc.) but rather a highly organized sequence 
in which particular symbols are ordered according to specific rules most of which are 
defined by t he la nguage grammar. Since the existence of grammar is an eme rgent 
phenomenon ^El, language can be counted among the complex systems The 
grammatical rules together with the information content impose on the language 
elements relations which can be most easily expressed in a form of network where, 
for instance, words are expressed by nodes and th eir relation s by edges. Some earlier 
attempts along this way were presented in Refs. EEEEHO] f or Erigrisb^ Portuguese 
and Chinese. Here we show a few results that were obtained for English and Polish 
and for different types of texts (literary or scientific). 

2. Methods and data 

Our analysis was based on texts samples written in two languages: English and 
Polish. Both belong to the Indo-European family, but to different groups: the West- 
Germanic and the West-Slavic group, respectively. Their grammar therefore signif- 
icantly differs, most notably in the existence of rich inflection of words in Polish 
as compared to a rather residual one in English. However, in the present work 
we do not deal with the semantic analysis, but restrict our study to a statistical 
analysis of word adjacency. As regards the English part, we analyze two groups of 
texts. The first one comprises the literary texts represented by 9 works of prose 
("Ulysses" and "Finncgans Wake" by J. Joyce, "Alice's Adventures in Wonder- 
land" by L. Carroll, "Adventures of Huckclbcrry Finn" by M. Twain, "Pride and 
Prejudice" by J. Austen, "Oliver Twist" by C. Dickens, "Secret Adversary" by 
A. Christie, "Adventures of Sherlock Holmes" and "Study in Scarlet" by A. Co- 
nan Doyle), 4 dramas by W. Shakespeare ("Hamlet", "Macbeth", "Winter Talc", 
and "Romeo and Juliet"), 61 poems by O. Wilde, and 25 poems by T.S. Elliott. 
The second group comprises the scientific tex ts r epresented by select ed w orks of 
E. WittenEU E.G.D. Cohen El, S. Weinberg El, and P.W. Anderson 031, as we ll 
as by three long reviews by D. SornetteEl R Albert and A.-L. Barabasi El ; anc j 
J. Kwapieh and S. Drozdz^l. Somewhere at the interface of these two groups, there 
is "A Brief History of Time" by S. Hawking and "The Emperor's New Mind: Con- 
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corning Computers, Minds, and the Laws of Physics" by R. Penrose representing 
popular science. The Polish language was represented by the novels: "Lalka" ("The 
Doll") by B. Prus, "Bramy Raju" ("The Gates of Paradise") by J. Andrzejewski, 
"Dolina Issy" ("The Issa Valley") by C. Milosz, "Cesarz" ("The Emperor: Downfall 
of an Autocrat") by R. Kapuscinski, "Dzicnniki gwiazdowe" ("The Star Diaries") by 
S. Lem, "Ferdydurke" by W. Gombrowicz, the epic "PanTadeusz" ("Sir Thaddeus") 
by A. Mickiewicz, the only Polish translation of "Ulysses" done by M. Slomczyhski, 
35 poems by C. Milosz, and 99 poems by W. Szymborska. 

All the texts were filtered in order to remove some of the punctuation marks 
(all except the ones that can functionally end sentences: the periods, the colons and 
semicolons, the question and exclamation marks) as well as the non-word sequences 
like numbers. The so-preprocessed texts were subject to further analysis. 

3. Results 

Although language and language samples arc traditionally subject to purely quali- 
tative analysis in the fields of humanities, the language samples consist of symbolic 
sequences, which can be easily subject to quantitative analysis. Historically, the 
beginning of quantitative analysis of natural language is usually associated with 
the name of G.K. Zipf, who was the first to carry out an exte nsive study of word 
frequencies in written texts in a few different lang uag es l -^ l -^ despite that in fact 
he also had known predecessors like J.-B. Estoup G^land EX. Thorndike ^201 who 
did some research in the same direction but far less extensive than the Zipf 's and 
without any significant impact on science. The main result attributed to Zipf is his 
eponymous law stating that number F of occurrences of words ordered according 
to their relative frequency in text samples is, roughly, inverse proportional to their 
rank R. For the English language, it is: 



while for other languages a can also be slightly smaller or larger than 1. A stands 
here for an empirical proportionality constant equal to 0.1T, where T is the total 
number of words in a sample (a sample's length). For single texts samples like 
books, the above power-law relation is usually well preserved for medium ranks 
(e.g., 10 < R < 1000) in the majority of cases, but for the lowest and the highest 
ranks it breaks down leading to flattening of F(R) for the most frequent words and 
to its faster decline for the least frequent ones. As regards the multi-piece unions 
of text samples (corpora), the scaling (JTJ) holds up to the ranks of a few th ousand 
and above them another scaling region with a larger value of a appears I 21 | 22 [ 
The latter phenomenon can be interpreted as a division of the vocabulary into two 
sections: the basic vocabulary containing words that are shared by almost all the 
text samples, thus common to all people, and the specialist vocabulary that can 
be subject-specific and author-specific An interesting property of the Zipf law 
is its similarity to a number of other relations that can be found in different and 
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Fig. 1. Rank-frequency distribution of words F(R) in the English texts of (a) "Ulysses" by 
J. Joyce, (b) "A Brief History of Time" by S. Hawking, (c) "Finncgans Wake" by J. Joyce, and (d) 
in the Polish text of "Pan Tadeusz" ( "Sir Thaddeus" ) by A. Mickicwicz. The empirical distributions 
are compared with the corresponding best fits in terms of a power- law function (Eq. ifT}) with the 
scaling exponent a. 



sometimes very distant field s: c ity population, scientific paper citations, earthquake 
magnitude, and many more^Sl. 

As regards the rank-frequency relation for text samples, Fig. [TJa) shows such a 
plot for "Ulysses" . It is notable for its uniquely broad range of ranks (3 decades) for 
which a power-law scaling holds, which is extremely rarely equalled by other pieces 
of texts. For example, a similar plot for "A Brief History of Time" in Fig. QJb) 
reveals scaling valid for only 2 decades. This means that the vocabulary volume of 
this book is smaller than it would be expected from the power-law relation holding 
over all the ranks. On contrary, "Finncgans Wake" (Fig. [He)) possesses extremely 
diverse vocabulary and the rarest words are overreprcscnted leading to breaking of 
the Zipf-likc relation in the opposite direction as compared to the Hawking's book. 
Such a situation does not surprise us, however, since "Finncgans Wake" is known 
to be a highly experimental piece of text comprising words from many languages. 
Next, Fig. [Tf d) shows a rank-frequency plot obtained for the Polish text of "Pan 
Tadeusz". A well-fitted power-law function with a = 0.84 describes the plot, whose 
slope is m uch sm aller than for typical English texts and even for typical P olis h ones 
while it is characteristic for the works of A. Mickicwicz ^1 

Characterization of a text sample by means of the Zipf plot is informative in 
respect to vocabulary volume of an author and mutual relations between the most 
common and other words, but it is insensitive to any kind of correlations possibly 
present in the sample. In order to incorporate correlations into our analysis, we 
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Fig. 2. Binary network representation (left) and minimal spanning tree (right) of an exemplary 
text sample: "Statistical mechanics of complex networks" . Both pictures correspond to the succes- 
sion network. 

create network representations of each of the text sam2ples studied here. We choose 
such a representation in which different words are regarded as different network 
nodes. One type of interesting correlations that can be quantified in this way is the 
adjacency relation between pairs of words. Two words are considered related and 
their nodes linked by an edge if they are the nearest neighbours at least once in 
a sample. For a given word, there are two possible relations with its neighbours: 
precedence and succession. The former is when a neighbour precedes the considered 
word, while the latter is in the opposite case. In this context, we study two networks: 
the precedence (left-side neighbourhood) network and the succession (right-side 
neighbourhood) network. We decided to consider only the neighbours that belong to 
the same sentence and neglect the inter-sentence adjacency. This does not influence 
our results, however: a preliminary analysis carried on a few text samples showed 
that there is no qualitative difference of the results between these cases. This, of 
course, might be a consequence of a much smaller number of such inter-sentence 
pairs (roughly, less than 10% of all pairs). 

By construction, our networks can be either binary or weighted. In the former 
case, we consider two nodes to be linked by an edge if the respective words are 
neighbours at least once in a text but we do not pay attention to how many times 
they neighbour each other. In contrast, in the latter case, we may count the number 
of such occurrences and attribute a corresponding weight to each edge. Examples 
of both situations arc presented in Fig. [2j where a binary network (left) and a 
minimal spanning tree calculated from a weighted network (right) are created for 
an exemplary piece of text. 

To begin with, let us calculate a cumulative distribution P(X > k) of node 
degrees k for different texts. This is one of the most informative quantities since it 
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Fig. 3. Cumulative distributions P(X > k) of the node degrees k for the word-adjacency network 
representations of English literary texts: (a) "Ulysses" , (b) "Alice's Adventures in Wonderland" , 
(c) "Adventures of Huckelberry Finn", (d) "Pride and Prejudice", (e) 4 Shakespeare dramas, and 
(f) 25 poems by T.S. Elliott. The precendence and succession networks are shown simultaneously 
in each panel. 



allows one to detect a hierarchical and scale-free structure of a given network L^-"-! 
Since such distributions for weighted networks carry basically the same information 
as the Zipf plots, we restrict our analysis to the binary networks only. Fig. [3] exhibits 
cumulative distributions P(X > k) for selected literary texts in English. Interest- 
ingly, although there are clear differences between the distributions for different 
texts, all the texts studied (including other not shown here) reveal the scale-free 
or almost scale-free dependence for some range of fc, with the scaling exponents 
I < f3 < 2 b eing in agreement with the results from other studies of different 
systems I 28 | 29 | j|. na pp ens f or some texts that the precedence and the succession 
networks visibly differ from each other. We do not inspect this issue in more detail, 
but a source of this difference might be cither author-specific or text-specific. 
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Fig. 4. Cumulative distributions P(X > k) of the node degrees k for the word-adjacency network 
representations of Polish literary texts: (a) "Lalka", (b) "Bramy Raju", (c) "Pan Tadeusz", (d) a 
Polish translation of "Ulysses", (e) 99 poems by W. Szymborska, and (f) 35 poems by C. Milosz. 
The precendence and succession networks are shown simultaneously in each panel. 

In Fig. 01 the P(X > k) distributions are shown for selected Polish literary 
texts. For prose ((a)-(d)), the scale-free slopes of these distributions are even better 
visible than for the English texts in Fig. [3] The same refers to poems except the 
overrepresentation of nodes with small k in the case of Polish poetry (Fig. |U[e)- 
(f)). This overrepresentation probably stems for the fact that poetry, which needs 
a specific rhythm, imposes strong restrictions on the words that can be used in 
particular places. 

In order to compare statistical properties of node degrees for different types of 
texts and the two languages, from the texts considered in this work, we create five 
separate corpora containing English prose, English poetry, English scientific texts, 
Polish prose, and Polish poetry. Then for each corpus, we calculate a node degree 
cumulative distribution P(X > k). Fig. [5] shows these distributions along with their 
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Fig. 5. Cumulative distributions P(X > k) of the node degrees k for the word-adjacency network 
representations of English (a) and Polish (b) corpora containing prose, poetry or scientific texts. 
Slopes of the best-fitted power laws are indicated by the dashed lines and the values of the scaling 
exponent /3. 



power-law slopes. Regarding prose, the distribution for the English language has 
smaller slope than its counterpart for the Polish language. However, the picture for 
poetry looks different: the node degree distribution is steeper in the case of English. 
Roughly, as regards the corpora of Polish prose and of Polish poetry, the distribu- 
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Fig. 6. Visualization of a network representation of an exemplary text sample showing a hierarchy 
of nodes. 

tions look similar, which is not the case for their English counterparts. P(X > k) 
for the corpus of scientific texts written in English reveals the most steep slope 
with /? = 1.31. This is not surprising, however, since many scientific texts are full 
of mathematics and related formal names and expressions, which make the vocab- 
ulary poorer than in the case of literary works, which do not have any vocabulary 
restrictions (see Fig.[SJa)). As regards P(X > k) for the individual scientific papers, 
some of them do not reveal any trace of scaling while other are clearly scale-free. 
This strongly depends on the relative amounts of standard description and strict 
mathematical language: the less mathematics is there, the better scaling can be 
observed. 

Our results for both languages indicate that the adjacency networks in both 
representations are strongly non-democratic with a clear hierarchy of hubs. Indeed, 
Fig. [6] confirms this conclusion by showing both global and local hubs with large 
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Fig. 7. The average shortest path length L as a function of the number of nodes N for exemplary 
texts considered in this work (only N < 3000 is shown because the texts differ in their vocabulary 
volume). The inset shows magnification of the small- N region. 



values of k surrounded by clouds of peripheral nodes with k « 1. Other topological 
properties of the networks can be characterized by their spatial extension and incli- 
nation of nodes to form clusters. The former can be quantitatively described by the 
average characteristic path length L expressing the average node-to-node distance. 
For a binary network it is defined by: 



L 



1 



E 



L, 



N(N-l) ^ 8J ' 



(2) 



where L^j is the length of the shortest path connecting the nodes i and j (the length 
of each path is defined as a number of edges this path passes through). Magnitude 
of L and, especially, its dependence on N is different for different network types. 
It typically grows fast for regular lattices and chains (L ~ N), moderately fast for 
random networks of the Erdos-Renyi type {L ~ IniV), the small-w orld networks 
(L < IniV) and for the Barabasi- Albert networks (L ~ In iV/lnln N) ^[ slowly for 
the ultrasmall networks (L ~ In In N) ^Hl, while for densely connected networks it 
can roughly be independent of N. For our networks, values of this quantity (for 
the complete texts) belong to the interval: 2.7 < L < 3.8, which generally falls 
into the small-world networks regime. However, the asymptotic behavior of L(N) 
is significantly different from the small- world one since for N 3> 1, L is decreasing 
function of N (Figure E}. This is because, for large TV, the vocabulary volume V 
(here, V — N) used for writing a text grows much more slowly than the length T 
of the text (which is expressed by a general relation V ~ T s , where 0.4 < 6 < 0.6 - 
the so-called Heaps law) and this leads to increasing density of edges. 
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Fig. 8. The clustering coefficient C vs. the average shortest path length L/ log TV for the texts 
considered in this work. Different groups of texts are denoted by different symbols. 



The clustering coefficient C for an undirected binary network is expressed by 

N 4^ 



c 



Q'ijQ'jmQ'mi 



k(k - 1) 



(3) 



where a pq are binary edges (equal to 1 for the existing edge and otherwise). 
Its value is typically small for the Erdos-Renyi networks (C ~ iV" 1 ) and for the 
Barabasi- Albert networks (C ~ TV -0 - 75 ) 021, while it is large (and independent of 
N) for the small- world networks of the Watts-Strogatz type 

For all the texts except the highly mathematical papers by E. Witten and 
S. Weinberg, values of C are in the interval 0.09 < C < 0.34, but unlike the 
shortest path length, here we observe rather clear separation between Polish and 
English novels (including the Shakespeare's drama): 0.10 < C < 0.20 for the Pol- 
ish ones and 0.22 < C < 0.34 for the English ones. The poetry in both languages 
tends to have smaller clustering coefficient than any other considered piece of text: 
Cpl = 0.09 and Cen = 0.18. These results can be seen in a scatter plot in FigurcEl 
which shows values of the clustering coefficient and the shortest path length for 
all the texts considered in this work. Indeed, the Polish and the English texts oc- 
cupy different regions of the (C,L) plane with the Polish ones being characterized by 
smaller C. For English, one can also see that scientific texts may have different prop- 
erties than the literary texts, especially if they contain much mathematics (Witten, 
Weinberg). If they are more descriptional than mathematical, their properties can 
resemble the properties of literary texts (Sornette, Hawking, Penrose). 

It is interesting to note that unlike other network types mentioned above, for our 
adjacency networks, C is an increasing function of N and, typically, its dependence 
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Fig. 9. (a) The clustering coefficient C dependence on the number of nodes N and the text 
length T for two exemplary texts with the observed scaling behavior C ~ N JN or C ~ T 7r . The 
scaling indices ■y^ and ■yx are shown together with the corresponding least-squares fits, (b) The 
scaling index 7y vs. the average shortest path length L for those texts for which C'(T) was at least 
partially power-law. Different groups of texts are denoted by different symbols. 



is either power-law C ~ 7V 7JV (at least for large values of N) with the scaling 
index 7^ < 1 or not far from power-law. A similar power-law behavior can be 
seen for C(T), where T is the text length, but this is not surprising due to the 
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already- mentioned Heaps law: N ~ T s . Figure |Hk shows two examples of such 
power-law behavior for one English and one Polish text. As it was the case with C, 
the English and the Polish texts have distinct values of 7t, with the latter being 
significantly larger. Calculated values of this index for the texts that show clear 
power-law dependence of C(T) are collected in Figure [9Jd. Although 7t cannot be 
estimated for all the texts from Figure [8j the separation of the languages is clear 
also here. 

The observed behavior of L(N) and C(N) suggests that the networks considered 
in this work have their own specific structure that clearly distinguishes them from 
the most well-known network structures. They cannot be counted as small-world 
networks even though the average characteristic path length is relatively short. 
They also differ from the Barabasi- Albert networks despite the fact that some of 
they show the scale-free structure and from the random (Erdds-Rcnyi) ones. 

4. Conclusions 

We presented several results from our quantitative study of statistical and network 
properties of literary and scientific texts written in two substantially different lan- 
guages: English and Polish. We transformed the text samples into word-adjacency 
networks defined by the nodes representing individual words and the edges repre- 
senting pairs of directly neighbouring words. For the majority of the studied literary 
texts in both languages, the corresponding networks revealed the scale-free struc- 
ture, while this was rarely the case for the scientific texts. We also showed that there 
are differences in node degree distributions between prose and poetry, especially in 
English. Poetry has a detectable steeper distribution's slope than has prose. The 
slope for scientific texts is even steeper than for poetry, but this can be explained 
by typically poorer vocabulary in the former case. Despite these differences, all the 
network representations of texts were hierarchical with a few important hubs and 
the majority of less important nodes. No qualitative and quantitative difference be- 
tween the languages was noticed in this respect. This picture changed completely 
if we looked at other network statistics like the clustering coefficient and the av- 
erage shortest path length. The English texts appear to possess more clustered 
structure, while the Polish ones were less clustered. This result was attributed to 
differences in grammar of both languages, which was also indicated in the Zipf plots. 
Our results suggest that the word- adjacency networks cannot fully be described by 
any of the Erdos-Renyi, the Watts-Strogatz, and the Barabasi- Albert models even 
though these networks exhibit certain characteristics of the latter two models. Such 
networks will be a subject of our forthcoming study. 
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